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Abstract — For the additive white Gaussian noise channel with 
average codeword power constraint, new coding methods are 
devised in which the codewords are sparse superpositions, that 
is, linear combinations of subsets of vectors from a given design, 
with the possible messages indexed by the choice of subset. 
Decoding is by least squares, tailored to the assumed form of 
linear combination. Communication is shown to be reliable with 
error probability exponentially small for all rates up to the 
Shannon capacity. 



I. Introduction 

The additive white Gaussian noise channel is basic to Shan- 
non theory and underlies practical communication models. 
We introduce classes of superposition codes for this channel 
and analyze their properties. We link theory and practice by 
showing superposition codes from polynomial size dictionaries 
with least squares decoding achieve exponentially small error 
probability for any communication rate less than the Shannon 
capacity. A companion paper |7|,|8| provides a fast decoding 
method and its analysis. The developments involve a merging 
of modern perspectives on statistical linear model selection 
and information theory. 

The familiar communication problem is as follows. An en- 
coder is required to map input bit strings u = {ui ,U2, ■ ■ ■ , uk) 
of length K into codewords which are length n strings of real 
numbers ci, C2, . . . , c„, with norm expressed via the power 
(1/n) constrain the average of the power across 

the 2^ codewords to be not more than P. The channel adds 
independent A^(0; cr^) noise to the selected codeword yielding 
a received length n string Y. A decoder is required to map it 
into an estimate u which we want to be a correct decoding of 
u. Block error is the event u, bit error at position i is the 
event iii ^ Ui, and the bit error rate is ) X^fLi ^{ui^ui}- 
An analogous section en^or rate for our code is defined below. 
The reliability requkement is that, with sufficiently large n, 
the bit error rate or section error rate is small with high 
probability or, more stringently, the block error probability is 
small, averaged over input strings u as well as the distribution 
of Y. The communication rate R = K/n is the ratio of the 
input length to the codelength for communication across the 
channel. 

The supremum of reliable rates is the channel capacity 
C= (1/2) log2(l+P/cr^), by traditional information theory as 
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in [^46|, ll29l . lfT9l . Standard communication models, even in 
continuous-time, have been reduced to the above discrete-time 
white Gaussian noise setting, as in f29l,f26l. This problem 
is also of interest in mathematics because of relationship 
to versions of the sphere packing problem as described in 
Conway and Sloane fTE\. For practical coding the challenge 
is to achieve rates arbitrarily close to capacity with a codebook 
of moderate size, while guaranteeing reliable decoding in 
manageable computation time. 

We introduce a new coding scheme based on sparse super- 
positions with a moderate size dictionary and analyze its per- 
formance. Least squares is the optimal decoder. Accordingly, 
we analyze the reliability of least squares and approximate 
least squares decoders. The analysis here is without concern 
for computational feasibility. In similar settings computational 
feasibility is addressed in the companion paper |7|,|8|, though 
the closeness to capacity at given reliability levels is not as 
good as developed here. 

We introduce sparse superposition codes and discuss the re- 



liability of least squares in Subsection I-A of this Introduction 



Subsection I-B contrasts the performance of least squares with 
what is achieved by other methods of decoding. In Subsection 
|I-C[ we mention relations with work on sparse signal recovery 
in the high dimensional regression setting. Subsection |I-D| 
discusses other codes and Subsection II-EI discusses some 
important forerunners to our developments here. Our reliability 
bounds are developed in subsequent sections. 

A. Sparse Superposition Codes 

We develop the framework for code construction by lin- 
ear combinations. The story begins with a list (or book) 
Xi,X2t--,Xm of vectors, each with n coordinates, for 
which the codeword vectors take the form of superpositions 
PiXi + + ■ • ■ + PnXn- The vectors Xj which are 

linearly combined provide the terms or components of the 
codewords and the /3j are the coefficients. The received vector 
is in accordance with the statistical linear model 

Y = X/3 + e 

where X is the matrix whose columns are the vec- 
tors Xi,X2,.-.,Xn and e is the noise vector distributed 
Normal(0, (T^/). In keeping with the terminology of that sta- 
tistical setting, the book X may be called the design matrix 
consisting of p = variables, each with n observations, and 
this list of variables is also called the dictionary of candidate 
terms. 
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The coefficient vectors /3 are arranged to be of a specified 
form. For subset superposition coding we arrange for a number 
L of the coordinates to be non-zero, with a specified positive 
value, and the message is conveyed by the choice of subset. 
Denote B = N/L. If B is large, it is a sparse superposition 
code. In this case, the number of terms sent is a small fraction 
of dictionary size. With somewhat greater freedom, one may 
arrange the non-zero coefficients to be +1 or —1 times a 
specified value, in which case the superposition code is said 
to be signed. Then the message is conveyed by the sequence 
of signs as well as the choice of subset. 

To allow such forms of /?, we do not in general take the set 
of permitted coefficient vectors to be closed under a field of 
linear operations, and hence our linear statistical model does 
not correspond to a linear code in the sense of traditional 
algebraic coding theory. 

In a specialization we call a partitioned superposition code, 
the book X is split into L sections of size B, with one term 
selected from each, yielding L terms in each codeword out of 
a dictionary of size N = LB. Likewise, the coefficient vector 
/3 is split into sections, with one coordinate non-zero in each 
section to indicate the selected term. Optionally, we have the 
additional freedom of choice of sign of this coefficient, for a 
signed partitioned code. It is desirable that the section sizes 
be not larger than a moderate order polynomial in L or n, for 
then the dictionary is arranged to be of manageable size. 

Most convenient is the case that the sizes of these sections 
are powers of two. Then an input bit string of length K = 
L log2 B splits into L substrings of size log2 B. The encoder 
mapping from m to /? is then obtained by interpreting each 
substring of u as simply giving the index of which coordinate 
of 13 is non-zero in the corresponding section. That is, each 
substring is the binary representation of the corresponding 
index. 

As we have said, the rate of the code is i? = K/n 
input bits per channel uses and we arrange for R arbitrarily 
close to C. For the partitioned superposition code, this rate 
is R — {L\og B)/n. For specified rate R, the codelength 
n ~ [L/R) log B. Thus, the length n and the number of terms 
L agree to within a log factor. 

With one term from each section, the number of possible 
codewords 2^ is equal to = [N/L)^. Alternatively, if 
we allow for all subsets of size L, the number of possible 
codewords would be (^), which is of order {Ne/L)^ = 
{Be)^ , for L small compared to N. To match the number 
of codewords, it would correspond to reducing iV by a factor 
of 1/e. Though there would be the factor 1/e savings in 
dictionary size from allowing all subsets of the specified size, 
the additional simplicity of implementation and simplicity of 
analysis with partitioned coding is such that we take advantage 
of it wherever appropriate. 

With signed partitioned coding the story is similar, now with 
{2B)^ — {2N/L)^ possible codewords using the dictionary of 
size N — LB. The input string of length K = Llog2(2i?) = 
L(l+log2 B), splits into L sections with logj B bits to specify 
the non-zero term and 1 bit to specify its sign. For a rate R 
code this entails a codelength of n = {L/R) log(2_B). 

Control of the dictionary size is critical to computationally 



advantageous coding and decoding. Possible dictionary sizes 
are between the extremes K and 2^ dictated by the number 
and size of the sections, where K is the number of input bits. 
At one extreme, with 1 section of size B , one has X 
as the whole codebook with its columns as the codewords, 
but the exponential size makes its direct use impractical. At 
the other extreme we have L — K sections, each with two 
candidate terms in subset coding or two signs of a single term 
in sign coding with B = l; in which case X is the generator 
matrix of a linear code. 

Between these extremes, we construct reliable, high-rate 
codes with codewords corresponding to linear combinations 
of subsets of terms in moderate size dictionaries. 

Design of the dictionary is guided by what is known from 
information theory concerning the distribution of symbols in 
the codewords. By analysis of the converse to the channel 
coding theorem (as in |19|), for a reliable code at rate near 
capacity, with a uniform distribution on the sequence of input 
bits, the induced empirical distribution on coordinates of the 
codeword must be close to independent Gaussian, in the sense 
that the resulting mutual information must be close to its 
maximum subject to the power constraint. 

We draw entries of X independently from a normal distri- 
bution with mean zero and a variance we specify, yielding the 
properties we want with high probability. Other distributions, 
such as independent equiprobable ±1, might also suffice, with 
a near Gaussian shape for the codeword distribution obtained 
by the convolutions associated with sums of terms in subsets 
of size L. 

For the vectors /3, the non-zero coefficients may be assigned 
to have magnitude y/P/L, which with X having independent 
entries of variance 1, yields codewords X(3 of average power 
near P. There is a freedom of scale that allows us to simplify 
the coefficient representation. Henceforth, we arrange the 
coordinates of Xj to have variance P/ L and set the non-zero 
coefficients to have magnitude 1. 

Optimal decoding for minimal average probability of error 
consists of finding the codeword Xp with coefficient vector f3 
of the assumed form that maximizes the posterior probability, 
conditioning on X and Y. This coincides, in the case of 
equal prior probabilities, with the maximum likelihood rule 
of seeking such a codeword to minimize the sum of squared 
errors in fit to Y. This is a least squares regression problem 
min/3 ||y — X/3||^, with constraints on the coefficient vector. 

We show for all R < C, that the least squares solution, as 
well as approximate least squares solutions such as may arise 
computationally, will have, with high probability, at most a 
negligible fraction of terms that are not correctly identified, 
producing a low bit error rate. The heart of the analysis shows 
that competing codewords that differ in a fraction of at least 
ao terms are exponentially unlikely to have smaller distance 
from Y than the true codeword, provided that the section size 
B = L°- is polynomially large in the number of sections L, 
where a sufficient value of a is determined. For the partitioned 
superposition code there is a positive constant c such that for 
rates R less than the capacity C, with a positive gap A = C—R 
not too large, the probability of a fraction of mistakes at least 
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Fig. 1. Plot of comparison between achievable rates using our scheme and the theoretical best possible rates for block error probability of 10~* and 
signal-to-noise ratio {v) values of 20 and 100. The curves for our partitioned superposition code were evaluated at points with number of sections L ranging 
from 20 to 100 in steps of 10, with corresponding B values taken to be L"", where a„ is as given in Lemma 4 later on. For the v values of 20 and 100 
shown above, is around 2.6 and 1.6, respectively. 



ao is not more than 

exp{— ncminjA^, ao}}- 

Consequently, for a target fraction of mistakes ao and target 
probability e, the required number of sections L or equivalently 
the codelength n — [aL log L) /R depends only polynomially 
on the reciprocal of the gap A and on the reciprocal of ao- 
Indeed n of order [(1/ao) + (1/A)^] log(l/e) suffices for the 
probability of the undesirable event to be less than e. 

Moreover, an approach is discussed which completes the 
task of identifying the terms by arranging sufficient distance 
between the subsets, using composition with an outer Reed- 
Solomon (RS) code of rate near one. The Reed-Solomon code 
is arranged to have an alphabet of size B equal to a power 
of 2. It is tailored to the partitioned code by having the RS 
code symbols specify the terms selected from the sections. 
The outer RS code corrects the small fraction of remaining 
mistakes so that we end up not only with small section error 
rate but also with small block error probability. If Router — 
1 — 5 is the rate of an RS code, with < (5 < 1, then section 
error rate less than ao can be corrected, provided 2ao < 5. 
Further, if Rmner (or simply R) is the rate associated with our 
inner (superposition) code, then the total rate after correcting 
for the remaining mistakes is given by Rtotai = RinnerRouter- 
The end result, using our theory for the distribution of the 
fraction of mistakes of the superposition code, is that the block 
error probability is exponentially small. One may regard the 
composite code as a superposition code in which the subsets 
are forced to maintain at least a certain minimal separation, so 
that decoding to within a certain distance from the true subset 
implies exact decoding. 

Particular interest is given to the case that the rate R is 
made to approach the capacity C. Arrange R — C— A„ and 
ao = A^. One may let the rate gap A„ tend to zero (e.g. at 



a 1/ log n rate or any polynomial rate not faster than 1 / -/n), 
then the overall rate Rtot = (1 ~ 2ao)(C — A„) continues 
to have drop from capacity of order A„, with the composite 
code having block error probability of order 

exp{-ncA^}. 

The exponent above, of order (C — R)^ for R near C, is in 
agreement with the form of the optimal reliability bounds as 
in 1281 . Il40l . though here our constant c is not demonstrated 
to be optimal. 

In Figure [T] we plot curves of achievable rates using our 
scheme for block error probability fixed at 10^^ and signal to 
noise ratios of 20 and 100. We also compare this to a rate curve 
given in Polyanskiy, Poor and Verdu [40 J (the PPV curve), 
where it is demonstrated that for a Gaussian channel with 
signal to noise ratio v, the block error probability e, codelength 
n and rate R with an optimal code can be well approximated 
by the following relation. 



1 logn 

2 n 



where V = {v/2){v + 2)log^e/(u + 1)^ is the channel 
dispersion and Q is the complementary Gaussian cumulative 
distribution function. 

For the superposition code curve, the y-axis gives the 
highest Rcomp for which the error probability stays below 
10~^. These curves are based on the minimum of the bounds 



obtained by our lemma in Section III We see for the given 
V and block error probability values, the achievable rates 
using our scheme are reasonably close to the theoretically 
best scheme. Note that the PPV curve was computed with 
an approach that uses a codebook of size that is exponential 
in blocklength, whereas our dictionary, of size LB, is of 
considerably smaller size. 
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B. Contrasting Methods of Decoding 

As we have said the least squares decoder minimizes — 
with constraint on the form of coefficient vector /3. It 
is unknown whether approximate least squares decoding with 
rate R near the capacity C is practical in the equal power case 
studied here. Alternative methods include an iterative decoder 
that we discuss briefly here and convex optimization methods 
discussed here and in subsection II-CI 

The practical iterative decoder, for the partitioned super- 
position code, proposed and analyzed in |l2l,||8] is called an 
adaptive successive decoder. Decoding is broken into multiple 
steps, with the identification of terms in a step achieved when 
the magnitude of the inner product between the corresponding 
Xj's, and a computed residual vector is above a specified 
threshold. The residual vector for each step being obtained as 
the difference of Y and the contribution from columns decoded 
in previous steps. 

With a rate that is of order 1 / log B below capacity, the error 
probability attained there is exponentially small in L/ (log B)^, 
to within a log log S factor This error exponent is slightly 
smaller than the optimal n/(log B)^, obtained here by the least 
squares scheme. Moreover, as we saw above, the least squares 
decoder achieves the optimal exponent for other orders A„ of 
drop from capacity. 

The sparse superposition codes achieving these performance 
levels at rates near capacity, by least squares and by adaptive 
successive decoding are different in an important aspect. For 
the present paper, we use a constant power allocation, with the 
same power P/L for each term. However in |7|, to yield rates 
near capacity we needed a variable power allocation, achieved 
by a specific schedule of the non-zero /3j's. In contrast, if one 
were to use equal power allocation for the decoding scheme 
in iQ, then reliable decoding holds only up to a threshold rate 
i?o = (1/2)P/(P + 0-2), which is less than the capacity C, 
with the rate and capacity expressed in nats. 

The least squares optimization min||F — is made 

challenging by the non-convex constraint that there be a 
specified number of non-zero coefficients, one in each section. 
Nevertheless, one can consider decoders based on projection 
to the convex hull. This convex hull consists of the /3 vectors 
which have sum in each section equal to 1 . (With signed cod- 
ing it becomes the constraint that the li norm in each section 
is bounded by 1.) Geometrically, it provides a convex set of 
linear combinations in which the codewords are the vertices. 
Decoding is completed with convex projection by moving to 
a vertex, e.g. with the largest coefficient value in each section. 
This is a setting in which we initiated investigations, however, 
in that preliminary analysis, we found that such constrained 
quadratic optimization allows for successful decoding only for 
rates up to Rthres for the equal power case. It is as yet unclear 
what its reliability properties would be at rates up to capacity 
C with variable power. 

C. Related Work on Sparse Signal Recovery 

The conclusions regarding communication rate may be also 
expressed in the language of sparse signal recovery and com- 
pressed sensing. A number of terms selected from a dictionary 



is linearly combined and subject to noise in accordance with 
the linear model framework Y = X/3 + e. Let N be the 
number of variables and L the number of non-zero terms. 
An issue dealt with by these fields, is the minimal number 
of observations n sufficient to reliably recover the terms. In 
our setting, the non-zero values of the coefficients are known 
and n satisfies the relationship n = (l/R) log (^) for general 
subsets and n = {1/ R)L\og{N / L) for the partitioned case. 
We show that reliable recovery is possible provided R < C. 

The conclusions here complement recent work on sparse 
signal recovery |T4l,f20l, f22\ in the sparse noise case and 
1481, [50|,|24|,[47|, 113] in the Gaussian noise case. Con- 
nections between signal recovery and channel coding are also 
highlighted in pT^. A hallmark of work in signal recovery is 
allowance for greater generality of signal coefficient values. In 
the regime as treated here, where N ^ L and where there is 
a control on the sum of squares of the coefficients as well as 
a control on the minimum coefficient value, conclusions from 
this literature take the form that the best n is of the order 
L\og{N / L), with upper and lower bounds on the constants 
derived. It is natural to call (the reciprocal of) the best constant, 
for a given set of allowed signals and given noise distribution, 
the compressed sensing capacity or signal recovery capacity. 

For the converse results in |48|,|50|, Fano's inequality is 
used to establish constants related to the channel capacity. 
Refinements of this work can be found in ^2A\. Convex 
projection methods with li constraints as in |49|,|47|,|13|, 
have been used for achievability results. The same order of 
performance is achieved by a maximum correlation estimator 
1.24.1 . Analysis of constants achieved by least squares is in 
BSll . ||23| . The above analysis, when interpreted in our setting, 
correspond to saying that these schemes have communication 
rate that is positive, though at least a fixed amount below the 
channel capacity. For our setting, a consequence of the result 
here is that the signal recovery capacity is equal to the channel 
capacity. 

D. Related Communication Issues and Schemes 

The development here is specific to the discrete-time chan- 
nel for which Yi ~ Ci + Si for i = 1,2, ... ,n with real- 
valued inputs and outputs and with independent Gaussian 
noise. Standard communication models, even in continuous- 
time, have been reduced to this discrete-time white Gaussian 
noise setting, or to parallel uses of such, when there is a 
frequency band constraint for signal modulation and when 
there is a specified spectrum of noise over that frequency band, 
as in 1291, lH. 

Standard approaches, as discussed in ll26l . entail a decom- 
position of the problem into separate problems of coding and 
of shaping of a multivariate signal constellation. For the low 
signal-to-noise regime, binary codes suffice for communication 
near capacity and there is no need for shaping. There is prior 
work concerning reliable communications near capacity for 
certain discrete input channels. Iterative decoding algorithms 
based on statistical belief propagation in loopy networks have 
been empirically shown in various works to provide reliable 
and moderately fast decoding at rates near the capacity for 
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such channels, and mathematically proven to provide such 
properties in certain special cases, such as the binary erasure 
channel in 1371 . ll3Fl . These include codes based on low 
density parity check codes ll28l and turbo codes ifTTl . lfT2l . 
See 143)1 . 1441 for some aspects of the state of the art with 
such techniques. 

A different approach to rehable and computationally feasi- 
ble decoding to achieve the rates possible with restriction to 
discrete alphabet signaling, is in the work on channel polariza- 
tion of Arikan and Telatar lO], H|. They achieve rates up to the 
mutual information / between a uniform input distribution and 
the output of the channel. Error probability is demonstrated 
there at a level exponentially small in n^/^ for any fixed R<I. 
In contrast for our codes the error probability is exponentially 
small in n{C—R)^ for the least squared decoder and within a 
log factor of being exponentially small in n for the practical 
decoder in |7|, |8|. Moreover, communication is permitted 
at higher rates beyond that associated with a uniform input 
distribution. We are aware from personal conversation with 
Imre Telatar and Emanuel Abbe that they are investigating 
the extent to which channel polarization can be adapted to 
Gaussian signaling. 

In the high signal-to-noise regime, one needs a greater signal 
alphabet size. As explained in [26], along with coding schemes 
on such alphabets, additional shaping is required in order to 
be able to achieve rates up to capacity. Here shaping refers to 
making the codewords vectors approximate a good packing of 
points on the n dimensional sphere of square radius dictated 
by the power. An implication is that, marginally and jointly 
for any subset of codeword coordinates, the set of codewords 
should have empirical distribution not far from Gaussian. 
Notice that we build shaping directly into the coding scheme 
by the superposition strategy yielding codewords following a 
Gaussian distribution. 

Our ideas of sparse superposition coding are adapted to 
Gaussian vector quantization in, Kontoyiannis, Gitzenis and 
Rad ||34| . Applicability to vector quantization is natural be- 
cause of the above-mentioned connection between packing and 
coding. 

E. Precursors 

The analysis of concatenated codes in Forney ll25l is an 
important forerunner to the development we give here. He 
identified benefits of an outer Reed-Solomon code paired 
in theory with an optimal inner code of Shannon-Gallager 
type and in practice with binary inner codes based on linear 
combinations of orthogonal terms (for target rates K/n less 
than 1 such a basis is available). The challenge concerning 
theoretically good inner codes is that the number of messages 
searched is exponentially large in the inner codelength. Forney 
made the inner codelength of logarithmic size compared to the 
outer codelength as a step toward practical solution. However, 
caution is required with such a strategy. Suppose the rate of the 
inner code has only a small drop from capacity, A = C — R. 
For small inner code error probability, the inner codelength 
must be of order at least 1/A^. So with that scheme one has 
the undesirable consequence that the required outer codelength 
becomes exponential in 1/A^. 



For the Gaussian noise channel, our tactic to overcome that 
difficulty uses a superposition inner code with a polynomial 
size dictionary. We use inner and outer codelengths that are 
comparable, with the outer code used to correct errors in a 
small fraction of the sections of the inner code. The overall 
codelength to achieve error probability e remains of the order 
(l/A2)log(lA). 

Another point of relationship of this work with other ideas 
is the problem of multiple comparisons in hypothesis tests. 
False discovery rate ITOl for a given significance level, rather 
than exclusively overall error probability is a recent focus 
in statistical development, appropriate when considering very 
large numbers of hypotheses as arise with many variables in 
regression. Our theory for the distribution of the fraction of 
incorrectly determined terms (associated with bit error rate 
rather than block error rate) provides an additional glimpse of 
what is possible in a regression setting with a large number 
of subset hypotheses. The work of 1 32 1 is a recent example 
where subset selection within groups (sections) of variables is 
addressed by extension of false discovery methods. 

The idea of superposition coding for Gaussian noise chan- 
nels began with Cover ifTSll in the context of multiple-user 
channels. In that setting what is sent is a sum of codewords, 
one for each message. Here we are putting that idea to use for 
the original Shannon single-user problem. The purpose here of 
computational feasibility is different from the original multi- 
user purpose which was identification of the set of achievable 
rates. Another connection with that broadcast channel work by 
Cover is that for such Gaussian channels, the power allocation 
can be arranged such that messages can be peeled off one 
at a time by successive decoding. Related rate splitting and 
successive decoding for superposition codes are developed for 
Gaussian multiple-access problems in ITSl and l45l . where 
in some cases to establish such reductions, rate splitting is 
applied to individual users. However, feasibility has been lack- 
ing in part due to the absence of demonstration of reliability 
at high rate with superpositions from polynomial size code 
designs. 

It is an attractive feature of our solution for the single-user 
channel that it should be amenable to extension to practical 
solution of the corresponding multi-user channels, namely, the 
Gaussian multiple access and Gaussian broadcast channel. 

Section [n] contains brief preliminaries. Section III provides 
core lemmas on the reliability of least squares for our super- 



position codes. Section IV analyzes the matter of section size 
sufficient for reliability. Section[V]confirms that the probability 
of more than a small fraction of mistakes is exponentially 
small. Section VI discusses properties of the composition 



of our code with a binary outer code for correction of any 
remaining small fraction of mistakes. The appendix collects 
some auxiliary matters. 

II. Preliminaries 

For vectors a, b of length n, let ||a|p be the sum of squares 
of coordinates, let |ap = {1/n) be the average square 

and let a-b = (1/n) J27=i '^i^i '■^^ associated inner product. 
It is a matter of taste, but we find it slightly more convenient 
to work henceforth with the norm \a\ rather than 1 1 all. 
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Concerning the base of the logarithm (log) and associated 
exponential (exp), base 2 is most suitable for interpretation 
and base e most suitable for the calculus. For instance, the 
rate R = (L log i?)/n is measured in bits if the log is base 2 
and nats if the log is base e. Typically, conclusions are stated 
in a manner that can be interpreted to be invariant to the choice 
of base, and base e is used for convenience in the derivations. 

We make repeated use of the following moment generating 
function and its associated large deviation exponent in con- 
structing boimds on error probabilities. If Z and Z are normal 
with means equal to 0, variances equal to 1, and correlation 
coefficient p then E(e('^/^)(^^~-^^)) takes the value 

l/[l-A2(l-p2)]V2 

when < l/{l — p^) and infinity otherwise. So the asso- 
ciated cumulant generating function of (1/2)(Z^ ~ is 
-(l/2)log(l - A2(l-p2)), with the understanding that the 
minus log is replaced by infinity when A^ is at least 1/(1— p^). 
For positive A we define the quantity D = D{A, given 
by 

D = max{AA + (1/2) log(l - X^{l-p'^))}. 

This D matches the relative entropy D{p*\\p) between 
bivariate normal densities, where p{z, z) is the joint density of 
Z,Z of correlation p and where p*{z,z) is the joint normal 
obtained by tilting that density by e^'*'/^^^^ ~^ \ chosen to 
make (1/2) (Z^ — Z^) have mean A, when there is such a A. 

Let's give D(A, 1— p^) explicitly as an increasing function 
of the ratio A^/(l — p^). Working with logarithm base e, the 
derivative with respect to A of the expression being maximized 
yields a quadratic equation which can be solved for the optimal 

A* = ^(yrT^AV(i^-i). 

Let q = 4A^/(l-/9^) and 7 = \/l + q - 1, which is near 
q/2 when q is small and approximately when q is large. 
Plug the optimized A into the above expression and simplify 
to obtain D = (1/2) (7 - log(l -^7/2)), which is at least 
7/4. Thus D is the composition of strictly increasing non- 
negative functions (l/2)(7— log(l+7/2)) and 7 = vT+g— 1 
evaluated at q = 4A^/(1 — p^). For small values of this ratio, 
we see that D is near q/8 = {1/2)A'^ /{I- p"^). 

The expression corresponding to D but with the maximum 
restricted to < A < 1 is denoted Di = Di (A, 1 - p^), that is. 



max |AA 
o<A<i 



(l/2)log(l-A2(l-p2))}. 



The corresponding optimal value of A is min{l,A*}. When 
the optimal A is less than 1, the value of Di matches D as 
given above. 

The A=l case occurs when l + 4:A'^/{l-p'^) > {1+2A)'^, 
or equivalently A > (1 — /3^)/p^. Then the exponent is Di = 
A + (l/2)logp2, which is as least A - (l/2)log(l + A). 
Consequently, in this regime Di is between A/2 and A. 

The special case = 1 is included with Di = A. 



III. Performance of Least Squares 

As we have said, least squares provides optimal decoding 
of superposition codes. In this section we examine the per- 
formance of this least squares choice in terms of rate and 
reliability. We focus on partitioned superposition codes in 
which the codewords are superpositions with one term from 
each section. 

Let S be an allowed subset of terms. We examine first 
subset coding in which to each such S there is a corresponding 
coefficient vector (3 in which the non-zero coefficients take a 
specified positive value as discussed above. We may denote the 
corresponding codeword Xg = X(3. Among such codewords, 
least squares provides a choice for which \Y — Xs\^ is 
minimal. 

For a subset S of size L we measure how different it is 
from S*, the subset that was sent. Let £ = card{S — S*) be 
the number of entries of S not in S* . Equivalently, since S 
and S* are of the same size, it is the number of entries of S* 
not in S. 

Let S be the least squares solution, or an approximate least 
squares solution, achieving \Y — Xg]"^ < \y — ^s*]"^ + <5o 
with So > 0. We call card{S — S*) the number of mistakes. 
Indeed, for a partitioned superposition code it is the number 
of sections incorrectly decoded. 

There is a role for the function Cq = | log(l + av) for 
< a < 1, where v = P/a^ is the signal-to-noise ratio and 
Ci = C = (l/2)log(l + v) is the channel capacity. We note 
that Cq — aC is a non-negative concave function equal to 
when a is or 1 and strictly positive in between. The quantity 
Ca — olR is larger by the additional amount a{C—R), positive 
when the rate R is less than the Shannon capacity C. 

The function V'a(A) = -{l/2)\og[l- X^av / {1 + av)] with 
< A < 1 is the cumulant generating function of a test statistic 
in our analysis. 

Our first result on the distribution of the number of mistakes 
is the following. 

Lemma 1: Set a — IjL for an ^ € {1, 2, ... , L}. For approx- 
imate least squares with < < 2cr^ (Cq, — ai?) / log e, the 
probability of a fraction a = l/L mistakes is upper bounded 
by 

exp|— n^max^ {AAq — i/;q(A)} 

or equivalently, 
L 

aL 



exp {-nDi{A„,av/ (1 -|- av))) , 



where A^ = — aR — {So/2a'^) loge and v is the signal- 
to-noise ratio. 

Remark 1: We find this Lemma 1 to be especially useful for 
a in the lower range of the interval from to 1. Lemma 2 
below will refine the analysis to provide an exponent more 
useful in the upper range of the interval. 

Proof of Lemma 1: To incur £ mistakes, there must be an 
allowed subset S of size L which differs from the subset S* 
sent in an amount card{S — S*) = card{S* — S) = £ which 
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undesirably has squared distance |y — Xsp less than or equal 
to the value \Y - Xg. p + Jo achieved by S*. 

The analysis proceeds by considering an arbitrary such S, 
bounding the probability that \Y - Xgp <\Y - Xs* P + Sq, 
and then using an appropriately designed union bound to put 
such probabilities together. 

Consider the statistic T = T{S) given by 



ns) 



\Y - Xc 



\Y-X. 



We set a threshold for this statistic equal tot = S(j/{2(j^). The 
event of interest is that T <t. 

The subsets S and S* have an intersection = 5 fl 5* of 
size L — I and difference = S' — 5i of size I = aL. 
Given {Xj : j G S) the actual density of Y is normal 
with mean Xs^ = J2jeSi -^j variance (a^ + aP)I 
and we denote this density p{Y\Xsi)- In particular, there is 
conditional independence of Y and Xs2 given Xs^ . 

Consider the alternative hypothesis of a conditional distri- 
bution for Y given Xs^ and Xs2 which is Normal(X5, cr^/). 
It is the distribution which would have governed y if 5 were 
sent. Let ph{Y\Xs^,Xs2) = PhiY\Xs) be the associated 
conditional density. With respect to this alternative hypoth- 
esis, the conditional distribution for Y given Xs^ remains 
NormalCXsi , (ct^ + aP)I). That is, Ph{Y\Xs,) = p{Y\Xs,). 

We decompose the above test statistic as 



\Y-Xs,\' 
a'^ + aP 

\Y~Xs\^ 



ct2 + aP 



121 



Let's call the two parts of this decomposition Ti and T2, 
respectively. Note that Ti = Ti{Si) depends only on terms 
in S*, whereas T2 = T2{S) depends also on the part of S not 
in S*. 

Concerning T2, note that we may express it as 

n ph{Y\Xs) 



where 



1 a^+aP 

Ca = ^ log o 

2 a-' 



is the adjustment by the logarithm of the ratio of the normal- 
izing constants of these densities. 

Thus T2 is equivalent to a likeUhood ratio test statistic 
between the actual conditional density and the constructed 
alternative hypothesis for the conditional density of Y given 
Xsi and Xs2- It is helpful to use Bayes rule to pro- 
vide ph{Xs2\Y,Xs,) via the equality of ^"i^^f '.^f ^ and 



PHiY\Xs,,Xs2) 



p{Xs2\Xsi) 

and to interpret this equality as providing an 
alternative representation of the likehhood ratio in terms of 
the reverse conditionals for Xs2 given Xs^ and Y. 

We are examining the event Eg that there is an allowed 
subset 5* = 5*1 U 5*2 (with Si = S (1 S* of size L - £ and 
S2 = S - Si of size i) such that that T{S) is less than t. For 
positive A the indicator of this event satisfies 




-n(T(S)-t) 



because, if there is such an S with T{S) — t negative, then 
indeed that contributes a term on the right side of value at 
least 1. Here the outer sum is over Si C S* of size L — I. 
For each such Si, for the inner sum, we have t sections in 
each of which, to comprise ^2, there is a term selected from 
among B — 1 choices other than the one prescribed by S*. 

To bound the probability of Ei, take the expectation of both 
sides, bring the expectation on the right inside the outer sum, 
and write it as the iterated expectation, where on the inside 
condition on Y , Xs^ and X5. to pull out the factor involving 
Ti, to obtain that V[E(\ is not more than 



Y.^e--'^"'^^^^-'^^^s2\y,Xs,,Xs, { 

Si V 



S2 y 



A simplification here is that the true density for Xg^ is 
independent of the conditioning variables Y, Xs^ and Xs* . 

We arrange for A to be not more than 1. Then by Jensen's 
inequality, the conditional expectation may be brought inside 
the A power and inside the inner sum, yielding 



Si \ S2 



Xs2\Y,Xs^(^' 



-nT2(S) 



Recall that 



P{XS2) 

and that the true density for is independent of the 
conditioning variables in accordance with the p{Xs2) in 
denominator. So when we take the expectation of this ratio we 
cancel the denominator leaving the numerator density which 
integrates to \. Consequently, the resulting expectation of 
g-nT2(S') jjQj jnore than e""*^". The sum over ^2 entails 
less than = e"^^/^ choices so the bound is 



Si 



-aR-t] 



Now nTi{Si) is a sum of n independent mean-zero random 
variables each of which is the difference of squares of normals 
for which the squared correlation is = l/{l + av). 
So the expectation Ee^'^^'^^^^^^ is found to be equal to 
— A^Q!u/(l + q;w)]]"''^. When plugged in above it yields 
the claimed bound optimized over A in [0,1]. We recognize 
that the exponent takes the form -Di(A, l — p^) with l — p^ = 
av/{l+av) as discussed in the prehminaries. This completes 
the proof of Lemma 1 . 

Some additional remarks: The exponent Di in Lemma 1 
(and its refinement in Lemma 2 to follow) depends on the 

fraction of mistakes a and the signal-to-noise ratio v only 
through A„ = — aR — t and 1— p^. As we have seen, the 
A< 1 case occurs when < (1— and then D is near 
(1/2)A^/(1 — p^) when it is small; whereas, the A = 1 case 
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occurs when > (1 — and then the exponent is as 
least A„ - (1/2) log(l + A^) > Aa/2. 



This behavior of the exponent is similar to the usual order 
[C—Kf' for R close to C and order C—R for R farther from 
C associated with the theory in Gallager ll29l . 

A difficulty with the Lemma 1 bound is that for a near 
1 and for R correspondingly close to C, in the key quantity 
A^/(l — p^), the order of A^ is (1 — a)^, which is too close 
to zero to cancel the effect of the combinatorial coefficient. 

The following lemma refines the analysis of Lemma 1, 
obtaining the same exponent with an improved correlation 
coefficient. The denominator 1 — = a(l — Q;)/(l + aw) is 
improved by the presence of the factor (1 — a) allowing the 
conclusion to be useful also for a near 1. The price we pay 
is the presence of an additional term in the bound. 

For the statement of Lemma 2 we again use the test statistic 
T{S) as defined in the proof of Lemma L For interpretation of 
what follows with arbitrary base of logarithm, in that definition 




of T{S) multiply by lege and likewise take the threshold to o «„ 0.2 0.4 0.6 0.8 1 

bet=^loge. 



Lemma 2: Let a positive integer £ < i be given and let 
a — ljL. Suppose < i < — aR. As above let Eg be the 
event that there is an allowed L term subset S with S S* of 
size £ such that T{S) is less than t. Then P[E(] is bounded by 
the minimum for ta in the interval between t and Cq — aR 
of the following 



Fig. 2. Exponents of contributions to the error probability as functions of a = 
£/L using exact least squares, i.e., t = 0, with L = 100, i? = 2^'^, signal- 
to-noise ratio V = 15, and rate 70% of capacity. The red and blue curves are 
the — logP[i?f] and — logP[i?^] bounds, using the natural logarithm, from 
the two terms in Lemma 2 with optimized ta - The dotted green curve is dn a 
explained below. With ag = 0.1, the total probability of at least that fraction 
of mistakes is bounded by 1.8(10)~^'^. 



\La 



exp {-nDi{Ca - aR - ta, l-Pa)} 



+ exp { — nD{ta—t, c?vl (l+a^w)] }. 
where \~p\ = a{\ — a)vl (1 + av). 

Proof of Lemma 2: Split the test statistic T{S) = f{S)+T* 
where 



and 



T* 



1 



\Y-Xs\^ \Y-il-a)Xs,\' 



ct2 + a^p 



\Y-{l-a)Xs^\' \Y~Xs^ 



ct2 + a^P 



Likewise we split the threshold t = i+t* where t* = -~{ta—t) 
is negative and i — ta is positive. 

The event that there is an S with T{S) < t is contained in the 
union of the two events Eg, that there is an S with T{S) < t, 
and the event E^, that T* <t*. The pait T* has no dependence 
on S so it can be treated more simply. It is a mean zero 
average of differences of squared normal random variables, 
with squared correlation 1/(1 + a'^v). So using its moment 
generating function, P[i?^] is exponentially small, bounded by 
the second of the two expressions above. 

Concerning P[£'£], its analysis is much the same as for 
Lemma 1. We again decompose T{S) as the sum Ti(S'i) + 
f2{S), where fa (5) = T2{S) is the same as before. The 



difference is that in forming Ti(5i) we subtract 



rather than 



\Y-(l^)Xs*\^ 



. Consequently, 



^"1(51) 



\Y 



\Y-{l-a)Xs^- 
0-2 + a'^p 



which again involves a difference of squares of standardized 
normals. But here the coefficient {I — a) multiplying Xs' is 
such that we have maximized the correlations between the 
Y — Xsi and Y — (1— a)^^*. Consequently, we have reduced 
the spread of the distribution of the differences of squares of 
their standardizations as quantified by the cumulant generating 
function. One finds that the squared correlation coefficient is 
Pa — {l+a^v)/{l+av) for which l—Pa — a{l-a)v/{l+av). 
Accordingly we have that the moment generating function is 
^g-nAT(Si) ^ cxp{-(n/2) log[l - A2(l-p2 )]} wMch givcs 
rise to the bound appearing as the first of the two expressions 
above. This completes the proof of Lemma 2. 

The method of analysis also allows consideration of subset 
coding without partitioning. For, in this case all (^) subsets 
of size L correspond to codewords, so with the rate in nats we 
have e"^ = (^). The analysis proceeds in the same manner, 
with the same number (^^^j of choices of sets Si = S H S* 
where S and S* agree on L ~ £ terms, but now with (^7^) 
choices of sets 5*2 = S — S* of size £ where they disagree. 
We obtain the same bounds as above except that where we 
have — e""^ with the exponent aR it is replaced by 
(^-^) = e"^(") with the exponent R{a) defined by R{a) = 
R log (^^) / log (^) . Thus we have the following conclusion. 
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Corollary 3: For subset superposition coding, the proba- 
bility of the event Ei that there is a /3 that is incorrect in £ 
sections and has \Y - X^if <\Y - X/3*|2 + is bounded 
by the minimum of the same expressions given in Lemma 
1 and Lenoma 2 except that the term otR appearing in these 
expression be replaced by the quantity defined above. 



IV. Sufficient Section Size 

We come to the matter of sufficient conditions on the section 
size B for our exponential bounds to swamp the combinatorial 
coefficient, for partitioned superposition codes. 

We call a = (logB)/(logi) the section size rate, that is, 
the bits required to describe the member of a section relative 
to the bits required to describe which section. It is invariant 
to the base of the log. Equivalently we have B and L related 
hy B = L'^. Note that the size of a controls the polynomial 
size of the dictionary N = BL = L"-^^ . 

In both cases the codelength may be written as 

aL log L 

We do not want a requirement on the section sizes with 
a of order 1/{C — R) for then the complexity would grow 
exponentially with this inverse of the gap from capacity. 
So instead let's decompose A^, = + a{C — R) — ta 
where A^ = Ca — ctC We investigate in this section the 
use of A^-j to swamp the combinatorial coefficient. In the 
next section excess in A^, beyond that needed to cancel the 
combinatorial coefficient, plus a{C — R) — ta are used to 
produce exponentially small error probability. 

Define Da,v = Di{Aa,l- pi) and i)o,,v = Di{Ka,l-Pa)- 
Now -Di(A, 1— p^) is increasing as a function of A, so Da,v 
is greater than Da.v whenever Aq > A^. Accordingly, we 
decompose the exponent D^^v as the sum of two components, 
namely, Da,v and the difference Da,v — Da,v 

We then ask whether the first part of the exponent denoted 
Da,v is sufficient to wash out the affect of the log combina- 
torial coefficient log (^) . That is, we want to arrange for the 
normegativity of the difference 



d„,a = »^-Da,„ - log 



L 
La 



This difference is small for a near and 1. Furthermore, its 
constituent quantities have a shape comparable to multiples 
of a{l — a). Consider first A^, = Ca — ctC and take the log 
to be base e. It has second derivative — (l/2)?;^/(l + ai>)^. 
It follows that A„ > (l/4)a(l-a)t;^/(l ^-vf, since the 
difference of the two sides has negative second derivative, so 
it is concave and equals at a = and a = 1. Likewise 
(1-/9^) = a(\-oi)vl(\+oiv) so the ratio u = AK^/{l-pD 
is at least {l/A)a{\ — a)v'^{l + av)/{l + v)^. Consequently, 
whether the optimal A is equal to 1 or is less than 1, we find 
that Da,v is of order a(l — a). 

Similarly, there is the matter of log(j^), with La re- 
stricted to have integer values. It enjoys the upper bounds 
min(a, 1 — a)Llogi and -Llog2 so that it is not more than 
a{l-a){L\ogL)/ {1-5l) where 5l = (log2)/logL. 



Consequently, using n = {aLlogL)/R, one finds that 
for sufficiently large a depending on v, the difference c?„ „ 
is noimegative uniformly for the permitted a in [0, 1]. The 
smallest such section size rate is 

tty L = max , 

« Da,vL\0gL 

where the maximum is for a in {1/L, 2/L, . . . , 1 — This 
definition has the required invariance to the choice of base of 
the logarithm, assuming that the same base is used for the 
communication rate R and for the Cq, — aC that arises in the 
definition of Da,v 

In the above ratio the numerator and denominator are both 
at a = and a = l (yielding dn,a = at the ends). Accordingly, 
we have excluded and 1 from the definition of a„^i for finite 
L. Nevertheless, limiting ratios arise at these ends. 

We show that the value of a.yx is fairly insensitive to the 
value of L, with the maximum over the whole range being 
close to a Umit a„ which is characterized by values in the 
vicinity of a = 1. 

Let V* near 15.8 be the solution to + log(l + i;*) = 
3v* log e. 

Lemma 4: The section size rate a„,L has a continuous limit 
= lim^^oo 0'v,L which is given, for < v < v*, by 

R 

" [(1 + u) log(l+'i;) - t;loge]2/[8'u(l+t;) loge] 

and for v > v* by 



R 



■ log(l+^;) -2t;loge]/[2(l+i;)] 

where v is the signal-to-noise ratio. With R replaced by C = 
(1/2) log(l-l-w) and using log base e, in the case 0<v<v*, 
it is 

4t;(l-|-z;) log(l+z;) 
[(l+w)log(l+t;)-t;]2 

which is approximately 16 /v'^ for small positive v; whereas, 
in the case v > v* it is 

(l+^)log(l+^) 
(l-Fu)log(l-hw) - 2v 
which asymptotes to the value 1 for large v. 

Proof of Lemma 4: For a in (0, 1) we use log (^) < L log 2 

and the strict positivity of Da.v to see that the ratio in the 
definition of Uy^L tends to zero uniformly within compact 
sets interior to (0, 1). So the limit Uy is determined by the 
maximum of the limits of the ratios at the two ends. In the 
vicinity of the left and right ends we replace log (^) by 
the continuous upper bounds aL log L and (1 — a)ZlogL, 
respectively, which are tight at a = 1/L and l — a~l /L, 
respectively. Then in accordance with L'Hopital's rule, the 
limit of the ratios equals the ratios of the derivatives at a = 
and a=l, respectively. Accordingly, 



max ■ 



R 



-R 
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where D'q ^ and D[ ^, are the derivatives of Da^v with respect 
to a evaluated at a = and a = l, respectively. 

To determine the behavior of Da — Da.v in the vicinity of 

and 1 we first need to determine whether the optimal A in 
its definition is strictly less than 1 or equal to 1. According to 
our earlier developments that is determined by whether < 
il—Pa)/Pa- The right side of this is a{l—a)v/{l + a'^v). So 
it is equivalent to determine whether the ratio 

{Ca-aC){l + a^v) 
a{l — a)v 

is less than 1 for a in the vicinity of and 1. Using L'Hopital's 
rule it suffices to determine whether the ratio of derivatives 
is less than 1 when evaluated at and 1. At a = it is 
(l/2)[t) — log(l + which is not more than 1/2 (certainly 
less than 1) for all positive v; whereas, at a = 1 the ratio of 
derivatives is (1/2)[(1 + w)log(l + v) — v]/v which is less 
than 1 if and only if v < v*. 

For the cases in which the optimal A < 1, we need to 
determine the derivative of Da at a = and a=l. Recall that 
Da is the composition of the functions (l/2)(7— log(l+7/2)) 
and 7 = ^/l + u— 1 and Ua = 4A^/ (1— p^). We use the chain 
rule taking the products of the associated derivatives. The first 
of these functions has derivative (1/2)(1 — 1/(2 + 7)) which 
is 1/4 at 7 = 0, the second of these has derivative l/(2\/l+u) 
which is 1/2 at u = 0, and the third of these functions is 

(log(l + av) - alog(l+w))^ 
" a{l — a)v/{l-\-av) 

which has derivative that evaluates to (w — log(l+f ))^/u at 
a = and evaluates to —[{\+v)\og{l+v) — vY /[v{\+v)] at 
a = l. The first of these gives what is needed for the left end 
for all positive v and the second what is needed for the right 
end for all v < v* . 

The magnitude of the derivative at 1 is smaller than at 0. 
Indeed, taking square roots this is the same as the claim that 
(1 + w) log(l + t;) — V < \/l+v{v — log(l + !'))■ Replacing 
s — \/l+v and rearranging, it reduces to slogs < (s^— 1)/2, 
which is true for s > 1 since the two sides match at s = 1 and 
have derivatives 1 + log s < s. Thus the limiting value for a 
near 1 is what matters for the maximum. This produces the 
claimed form of for v < v* . 

In contrast for w > v* , the optimal A = 1 for a in the vicinity 
of 1. In this case we use Da = + (1/2) logp^ which has 
derivative equal to -(l/2)[(l + w) log(l + i;) - 2u]/(l + u) at 
a = 1, which is again smaller in magnitude than the derivative 
at a = 0, producing the claimed form for for v > v* . 

Al V ^ V* we equate (1 + v) log(l + v) — 'iv and see that 
both of the expressions for the magnitude of the derivative at 

1 agree with each other (both reducing to w/(2(l + v))) so 
the argument extends to this case, and the expression for a„ 
is continuous in v. This completes the proof of Lemma 3. 

While flj, is undesirably large for small v, we have reason- 
able values for moderately large v. In particular, equals 5.0 
and 3, respectively, at w = 7 and v* — 15.8, and it is near 1 
for large v. 

Numerically is of interest to ascertain the minimal section 
size rate a„ L.£,ao' ^ specified L such as i = 64, for R 
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Fig. 3. Sufficient section size rate a as a function of the signal-to-noise 
ratio V. Tlie daslied curve shows a„ at L = 64. Just below it the thin solid 
curve is the limit for large L. For section size B > L° the error probabilities 
are exponentially small for all i? < C and any ao > 0. The bottom curve 
shows the minimal section size rate for the bound on the error probability 
contributions to be less than e~^'-\ with R = 0.8C and qq = 0.1 at L = 64. 



chosen to be a proscribed high fraction of C, say R = 0.8C, 
for ao a proscribed small target fraction of mistakes, say ao = 
0.1, and for e to be a small target probabiUty, so as to obtain 
mm{P[Ee], P[Ee\ + P[E'^]} < e, taking the minimum over 
allowed values of ta, for every a — £/L at least ao. For this 
calculation the bound from Lemma 1 is used for P[Ee] and 
the bound from Lemma 2 is used for P[E(] + P[E'^]. This is 
illustrated in Figure |3] plotting the minimal section size rate 
as a function of v for e — e~^°. With such R moderately 
less than C we observe substantial reduction in the required 
section size rate. 



Extra Aa beyond the minimum: Via the above analysis we 
determine the minimum value of A for which the combinato- 
rial term is canceled, and we characterize the amount beyond 
that minimum which makes the error probability exponentially 
small. Arrange A™™ to be the solution to the equation 



nA(ArM-p^)^log 



L 
La 



To see its characteristics, let A^'^''^''* = (1 - Pa^^'^Gira) at 



= - log 
n 



L 

La 



using log base e. Here G{r) is the inverse of the function 
D{5,1) which is the composition of the increasing functions 
(l/2)[7 - log(l+7/2)] and 7 = Vl + 4(5^ - 1 previously 
discussed, beginning in Section 2. This G(r) is near ^/2r for 
small r. When G(r) < {I- PaY^^/Pa the condition A < 1 
is satisfied and A""" — A*,'"'^'^* indeed solves the above 
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equation; otherwise A™ = — (1/2) logp^ provides the 
solution. 

Now = (i?/a) (log (j^^))/(LlogL). With aL restiicted 
to integers between and L, it is not more than {R/a)a 
and {R/a){l — a), with equality at particular a near and 1, 
respectively. It remains small, with < (i?/a)(log2)/logL, 
for < a < 1. Also we have l — p^ = a{l—a)v / (l+av) from 
Lemma 2. Consequently, A™™ is small for large L; moreover, 
for a near and 1, it is of order a and 1 — a, respectively, 
and via the indicated bounds, derivatives at and 1 can be 
exphcitly determined. 

The analysis in Lemma 4 may be interpreted as determining 
section size rates a such that the differentiable upper bounds on 
A™™ are less than or equal to A^ = C^-aC for < a < 1, 
where, noting that these quantities are at the endpoints of the 
interval, the critical section size rate is determined by matching 
the slopes at a = 1. At the other end of the interval, the bound 
on the difference Aq — A™™ has a strictly positive slope at 
a = 0, given by r„ = (l/2)[i; - log(l+w)] - [2vR/a]^/'^. 

Recall that A^ = — aR — ta- For a sensible probability 
bound in Lemma 2, less than 1, we need to arrange A^, greater 
than A™'". This we can do if the threshold t is less than 
Ca — aR — A™™ and ta is strictly between. 

Express Aq, as the sum of A™'", needed to cancel the 
combinatorial coefficient, and AJ^;'^''"" = Cq — ai?— A™'" — to, 
which is positive. This A^**'"'* arises in estabhshing that the 
main term in the probability bound is exponentially small. It 
decomposes as A^^*™ = a\c-R) + (A„-A;^'") which 
reveals different regimes in the behavior of the exponent. For 
high a what matters is the a{C—R) term, positive with R<C, 
and that ta stays less than the gap a{C—R). For small a, we 
approximate A«^*™ by a[{C-R) + t„] - i„. 

For moderate and small a, having R < C is not so impor- 
tant to the exponent, as the positivity of Aq,— A™'" produces a 
positive exponent even if R matches or is slightiy greater than 
C. In this regime, the Lemma 1 bound is preferred, where we 
set Aa = Ca — <xR — t without need for ta- 

V. Confirming Exponentially Small Probability 

In this section we put the above conclusions together to 
demonstrate the rehabihty of approximate least squares. The 
probability of the event of more than any small positive 
fraction of mistakes ao = to/L is shown to be exponentially 
small. 

Recall the setting that we have a random dictionary X of 
L sections, each of size B. The mapping from K-hil input 
strings u to coefficient vectors I3{u) is as previously described. 
The set B of such vectors (3 are those that have one non- 
zero coefficient in each section (with possible freedom for 
the choice of sign) and magnitude of the non-zero coefficient 
equal to 1. Let (3* = 0{u*) be the coefficient vector for an 
arbitrary input u* . We treat both the case of a fixed input, and 
the case that the input is drawn at random from the set of 
possible inputs. The codeword sent Xf3* is the superposition 
of a subset of terms with one from each section. The received 
string is y = -|- e with s distributed normal N{0,a'^I). 
The columns of X are independent A''(0, {P/L)I) and X and 



Y are known to the receiver, but not /3* . The section size rate 
a is such that B = L"^. 

In fashion with Shannon theory, the expectations in the 
following theorem are taken with respect to the distribution 
of the design X as well as with respect to the distribution of 
the noise; implications for random individual dictionaries X 
are discussed after the proof. 

The estimator /3 is assumed to be an (approximate) least 
squares estimator, taking values in B and satisfying |y-X/3p 
< |F-X/3*p -I- 5o, with 5o > 0. Let mistakes denote the 
number of mistakes, that is, the number of sections in which 
the non-zero term in /3 is different from the term in /3*. 

Suppose the threshold t = ^ log e is not more than 
{l/2)mma>ao{a{C - R) + (A„ - A™)}. Some natural 
choices for the threshold include t = 0, t — {l/2)ao{C—R), 
and t = (l/2)ao t^. For positive x let g{x) = min{a;, x^}. 

Theorem 5: Suppose the section size rate a is at least a^^L, 
that the communication rate R is less than the capacity C 
with codeword length n = {l/R)aL\ogL, and that we have 
an approximate least squares estimator. For io between 1 and 
L, the probability V[mistakes > £q] is bounded by the sum 
over integers £ from £o to L of V[E(] using the minimum of 
the bounds from Lemmas 1 and 2. It follows that there is a 
positive constant c, such that for all ao between and 1, 

F[mistakes > aoL] < 2L ex.p{—ncmm{ao, g{C — R)}} . 

Consequently, asymptotically, taking ag of the order of a 
constant times 1/L, the fraction of mistakes is of order 1/L 
in probabiUty, provided C—R is at least a constant multiple of 
1/VL. Moreover, for any fixed ao, a, and R, not depending 
on L, satisfying ao > 0, a> ay and R<C, we conclude that 
this probabihty is exponentially small. 

Proof: Consider the exponent Da^v = Di{Aa, 1— p^) as given 
at the start of the preceding section. We take a reference A^-^ 
for which A„ > A;,^^ and for which A'^^f is at least A™^" 
and at least a multiple of A^. 

The simplest choice is A"^^ = Aa, which may be used 
when t is less than a fixed fraction of ao (C—i?). Then A^ = 
Aa+a{C—R) — ta exceeds A„, taking ta to be between t and 
a{C—R). Small precision t makes for a greater computational 
challenge. Allowance is made for a more relaxed requirement 
that t be less than m.mag<a<i{a{C—R) + (1/2)Aq,} and less 
than a fixed fraction of mma„<a<i{oi{C — R) + Aq, — A™"^}. 
Both of these conditions are satisfied when t is less than the 
value (1/2) min„>ao {a(C-i?) + (Aq-A™)} stated for tiie 
theorem. 

Accordingly, set AJ,'^-^' = (1/2)[A« + A™™] to be half way 
between A™'° and Aq,. With t less than both [a(C - .R) -|- 
Aq - A™"] and [a(C - R) + (1/2) Aq], arrange ta > t to be 
less than both of these as well. For then A"^^ exceeds both 
A™ and (1/4)Aq as required. 

Now _Di(A,1 — p^) has a nondecreasing derivative with 
respect to A. So Da,v = -Di(Aq,1 — p^) is greater than 
D^e/ ^ Di{A'^^^f,l-pl). Consequently, it Ues above the 
tangent line (the first order Taylor expansion) at A^^^, that is, 

Da,v > Dl^l + {Aa - A-0 D', 



12 



where D' = D[ (A) is the derivative of Di (A) = Di{A, l-pl) 
with respect to A, which is here evaluated at AJ^^-^^. In detail, 
the derivative D'^ (A) is seen to equal 

1 2A 

l + Vl+4AV(l-pfj 

when A < (1 — Pa)/^^, and this derivative is equal to 1 
otherwise. [The latter case with derivative equal to 1 includes 
the situations a = and a=l where 1— p^ = with Di = A; 
all other a have 1 — >0.] 

Now lower bound the components of this tangent hne. First 
lower bound the derivative D' = I?'i(A) evaluated at A = 
A^^^ • Since this derivative is non-decreasing it is at least as 
large as the value at A = (1/4)A„. As in our developments 
in previous sections A^/(l — p^) is a bounded function of 
a. Moreover, Aq, and 1 — p"^^ are positive functions of order 
a(l— a) in the unit interval, with ratio tending to positive values 
as a tends to and 1, so their ratio is uniformly bounded 
away from 0. Consequently Wv = minaD'i{/V^f) is strictly 
positive. [This is where we have taken advantage of /y^^ 
being at least a multiple of A^; if instead we used A™™ as 
the reference, then for some a we would find the Di(A™'") 
being of order l/\/IogX, producing a shghtly inferior order 
in the exponent of the probabiUty bound.] 

Next examine D'^^f. Since ^''^f is at least A™°, it follows 
that Z)-/ is at least = i5(A™-, l-pl). 

Now we are in position to apply Lemma 2 and Lemma 4. 
If the section size rate a is at least a^^L we have that nD™™ 
cancels the combinatorial coefficient and hence the first term 
in the V[Ef] bound (the part controlUng P[-E^]) is not more 
than 

exp{-n[A„-A-/]D'}, 

where a = £/L. In the first case, with t < a{C — R) and 
/^ref _ this yields F[Ef] not more than the sum of 

exp{-n[a{C-R) - ta] D'} 

and 

ex.p{—nD{ta — t, c?vl (1 + a^v))}, 

for any choice of ta between t and a{C — K). For instance 
one may choose to be half way between t and q;(C— i?). 

Now if t is less than a fixed fraction of ai^iC—FC), we have 
arranged for both a{C—R) — ta and ta — t to be of order 
a{C—R) uniformly for a> uq. 

Accordingly, the first of the two parts in the bound has 
exponent exceeding a quantity of order ao{C—R). The second 
of the two parts has exponent related to a function of the ratio 
u = {a{C—R))^/[a'^v/{l + a^v)] as explained in Section 11, 
where the function is of order u for small u and order ^/u for 
large u. Here u is of order (C—R)'^ uniformly in a. It follows 
that there is a constant c (depending on v) such that 

V[Ei] < 2cxp{-ncmm{aa(C-R),g{C-R)}}. 

An improved bound is obtained, along with allowance of 
a larger threshold t, using half way between A™*" and 
Aq. Then the first part of the bound becomes 

exp{-n(l/2)[a((7-i?) - (A„-A™") - ta]D'} 



provided ta is chosen between t and a{C—R) + (Aq— A™'"), 
e.g. half way between works for our purposes. This bound 
is superior to the previous one, when R closely matches C, 
because of the addition of the non-negative (Aq — A™'") term. 
For a less than, say, 1/2, we use that the exponent exceeds a 
fixed multiple of aoTyWv; whereas for a > 1/2 we use that the 
exponent exceeds a fixed multiple of [C—R^Wv. For R<C, it 
yields the desired bounds on P[£'£], uniformly exponentially 
small for a> ao, with the stated conditions on t. 

With optimized ta, let Umin.a.t) be the minimum of the 
two exponents from the two terms in the bound on F[E(] at 
a = £/L. Likewise, let r>min = -Dmin,u be the minimum 
of these exponents for £ > a^L. We have established that 
-Dmin exceeds a quantity of order xmn{aQ,g{C — R)}. Then 
for £>aoL, 

f[Ei] < 2e""^""" 

and accordingly 

^mistakes > aoL] < 2ie-"^°>'". 

Using the form of the constants identified above, we see 
that even for ao of order 1/L, that is, for £q — agL 
constant, the probabihty F[mistakes > io] goes to zero 
polynomially in 1 /L. Indeed, for C — R at least a multi- 
ple of 1/VL, and sufficiently small t, the bound becomes 
2Lexp{—n{l/2)TyWyio/L} which with n = {a/ R)L\ogL 
becomes, 

T[mistakes > £„] < 2(l/L)(i/2)(a/fl)r„«;„^o-i, 

It is assured to go to zero with L for at least 2C/[a„r^w„]. 
This completes the proof of Theorem 5. 

Remarks: For a range of values of to, up to the point where a 

multiple of /L hits g{C—R), the upper tail of the distribution 
of the number of mistakes past a minimal value is shown 
to be less than that of a geometric random variable. Using 
the geometric sum, an alternative to the factor L outside the 
exponent can be arranged. 

The form given for the exponential bound is meant only to 
reveal the general character of what is available. In particular, 
via appeal to the section size analysis, we ensure to have can- 
celed the combinatorial coefficient and yet, for ii < C, to have 
enough additional exponent that the probability of a fraction of 
at least ao mistakes is exponentially small. A compromise was 
made, by introduction of an inequaUty (the tangent bound on 
the exponent) to proceed most simply to this demonstration. 
Now understanding that it is exponentially small, our best 
evaluation avoids this compromise and proceeds directly, using 
for each a the best of the bounds from Lemma 1 and Lemma 
2, as it provides substantial numerical improvement. 

The polynomial bound on more than a constant number 
of mistakes is here extracted as an aside to the exponential 
bound with exponent proportional to t. One can conclude, for 
sufficient section size rate a, using to = l, that the probability 
of even 1 or more mistake is polynomially small. Polynomially 
small block error probability is not as impressive when by a 
simple device it is made considerably better. Indeed, we have 
estabhshed smaller probability bounds with larger mistake 
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thresholds Iq. With certain such thresholds, fewer mistakes 
than that are guaranteed correctable by suitable outer codes; 
thereby yielding smaller overall block error probability. 

The probability of the error event E — {mistakes > aoL} 
has been computed averaging over random generation of 
the dictionary X as well as the distribution of the received 
sequence Y. In this case the bounds apply equally to an 
individual input u as well as with the uniform distribution on 
the ensemble of possible inputs. Implications of the bounds 
for a randomly generated dictionary X are discussed further 
in Appendix A. 

In the next section we review basic properties of Reed 
Solomon codes and discusses its role in correcting any existing 
section errors. 

VI. From Small Fraction of Mistakes to Small 
Probability of Any Mistake 

We employ Reed-Solomon (RS) codes (ETJ, as an 

outer code for correcting any remaining section mistakes. The 
symbols for the RS code come from a Galois field consisting 
of q elements denoted by GF{q), with q typically taken to 
be of the form 2™. If Kout, "-out represent message and 
codeword lengths respectively, then an RS code with symbols 
in GF{2"^) and minimum distance between codewords given 
by das can have the following parameters: 

'''Out — ^ 

"out — Kout — dRs — 1 

Here Uout ^ Kout gives the number of parity check symbols 
added to the message to form the codeword. In what follows 
we find it convenient to take B to be equal to 2™ so that can 
view each symbol in GF{2"'^) as giving a number between 1 
and B. 

We now demonstrate how the RS code can be used as an 
outer code in conjunction with our inner superposition code, 
to achieve low block error probability. For simplicity assume 
that _B is a power of 2. First consider the case when L equals 
B. Taking m = log2 B, we have that since L is equal to 
B, the RS codelength becomes L. Thus, one can view each 
symbol as representing an index in each of the L sections. 
The number of input symbols is then Kout = L — djis + 1, so 
setting S ~ d^s/L, one sees that the outer rate Rout^ equals 
1 — (5 + 1/i which is at least 1 — 5. 

For code composition Kout log2 B message bits become the 
Kout input symbols to the outer code. The symbols of the 
outer codeword, having length L, gives the labels of terms 
sent from each section using our inner superposition with 
codelength n = L \og2 B / Rinner- From the received Y the 
estimated labels ji, j2, ■ ■ ■ Jl using our least squares decoder 
can be again thought of as output symbols for our RS codes. If 
6e denotes the section mistake rate, it follows from the distance 
property of the outer code that if 26^ < S then these errors can 
be corrected. The overall rate Rcomp is seen to be equal to the 

product of rates RoutRinner which is at least (1 — S)Rinner- 

Since we arrange for Se to be smaller than some ao with 
exponentially small probability, it follows from the above that 



composition with an outer code allows us to communicate with 
the same reliability, albeit with a slightly smaller rate given 

by (1 - 2ao)Rtnner- 

The case when L < B can be dealt with by observing 
(['361, Page 240) that an {riout, Kout) RS code as above, can 
be shortened by length w, where < w < Kout, to form an 
{nout — w, Kout — w) code with the same minimum distance 
djis as before. This is easily seen by viewing each codeword as 
being created by appending Uout — Kout parity check symbols 
to the end of the corresponding message string. Then the code 
formed by considering the set of codewords with the w leading 
symbols identical to zero has precisely the properties stated 
above. 

With B equal to 2™ as before, we have Uout equals B 
so taking w to be i? — L we get an (n'o^f., K'o^f.) code, with 
"oMt ^ -^out ~ L — dfis + 1 and minimum distance dj^s- 
Now since the codelength is L and symbols of this code are 
in GF{B) the code composition can be carried out as before. 

We summarize the above in the following. 

Proposition 6: To obtain a code with small block error 
probability it is enough to have demonstrated a partitioned 
superposition code for which the section error rate is small 
with high probability. In particular, for any given positive e 
and ao, let i? be a rate for which the partitioned superposition 
code with L sections has 

Prob{# section mistakes > a^L} < e. 

Then through concatenation of such a code with an outer Reed- 
Solomon code, one obtains a composite code for which the rate 
is (1 — 2ao)R and the block error probability is less than or 
equal to e. 

Appendix A: Implications for random dictionaries 

Here we provide discussion of the imphcations of our 
error probability bound of Section V for randomly generated 
dictionaries X. 

The probability of the error event E = {mistakes > aoL} 
has been computed averaging over random generation of 
the dictionary X as well as the distribution of the received 
sequence Y. Let's denote the given bound P^. The theorem 
asserts that this bound is exponentially small. For instance, it 
is less than 2Le-"-°""". 

The same bound holds for any given K bit input sequence u. 
Indeed, the probability of E given that u is sent, which we may 
write as P[i?|7i] is the same for all u by exchangeability of the 
distribution of the columns of X. Accordingly, it also matches 
the average probability P[£^] = ^ J^u ^[K\u], averaging over 
all possible inputs, so this average probability will have the 
same bound. 

Reversing the order of the average over u and the average 
over the choice of dictionary X, the average probability may 
be written Ex[jk T,u^[E\'^^ ^]\' where V[E\u,X] denotes 
the probability of the error event E, conditioning on the event 
that the input is u and that the dictionary is X (the only 
remaining average in P[£'|m,X] is over the distribution of the 
noise). This P[£'|it,X] will vary with u as well as with X. 
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An appropriate target performance measure is 

nE\X] = l^^F[E\u,X], 

u 

the probability of the error event, averaged with respect to 
the input, conditional on the random dictionary X. Since the 
expectation f[E] =¥.\¥[E\XW satisfies the indicated bound, 
random X are likely to behave similarly. Indeed, by Markov's 
inequality P[P[£:|X] > tP^] < l/r. 

So with a single draw of the dictionary X, it will satisfy 
^E\X] < tPI, with probabihty at least 1 - l/r. The 
manageable size of the dictionary facilitates computational 
verification by simulation that the bound holds for that X. 
With r = 2 one may independently repeat the generation of 
X a Geometric(l /2) number of times until success. The mean 
number of draws of the dictionary required for one with the 
desired performance level is 2. Even with only one draw of 
X, one has with r = e^"/^)^"**", that 

except for X in an event of probabihty not more than 

Now P[£'|X] exponentially small imphes that f[E\u,X]is, 
exponentially small for most u (again by Markov's inequality). 
In theory one could expurgate the codebook, leaving only 
good performing /3 and reassigning the mapping from u to 
/3, to remove the minority of cases in which P[iJ|M, X] > 
4Le~^"/^)^'"'". Thereby one would have uniformly exponen- 
tially small error probabihty. 

In principle, simulations can be used to evaluate P[£;|u, X] 
for a specific (i and X, to decide whether that !3 should be 
used. However, it is not practical to do so in advance for all 
/3, and it is not apparent how to perform such expurgations 
efficiently on-line during communications. Thus we maintain 
our focus in this paper on average case error probabihty, 
averaging over the possible inputs, rather than maximal error 
probability. 

As we have said, for the average case analysis, armed with a 
suitable decoder, one can check, for a dictionary X, whether 
it satisfies an exponential bound on P[_E|X] empirically by 
simulating a number of draws of the input and of the noise. 
Nevertheless, it would be nice to have a more direct, non- 
sampling check that a dictionary X satisfies requirement for 
such a bound on P[iJ|X]. Our current method of proof does 
not facilitate providing such a direct check. The reason is that 
our analysis does not exclusively use the distribution of Y 
given u and X; rather it makes critical use of properties of 
the joint distribution of Y and X given u. 

Likewise, averaging over the random generation of the 
dictionary, permits a simple look at the satisfaction of the 
average power constraints. With a randomly drawn u, and 
associated coefficient vector /3 = I3{u), consider the behavior 
of the power \Xj3\'^ and whether it stays less than (l-|-e)P. The 
event A'^ = {\XP\'^ > {1-H)P}, when conditioning on the input 
u, has exponentially small probability P[A^|w,], in accordance 
with the normal distribution of the codeword obtained via the 
distribution of the dictionary X. Again PfA'^lu] is the same for 
all u and hence matches the average P[A'^] with expectation 



taken with respect to random input u as well as with respect to 
the distribution of X. So reversing the order of the expectation 
we have that E[P[A'=|X]] enjoys the exponential bound, 
from which, again by applications of Markov's inequality, 
except for X in an event of exponentially small probability, 
< (1 -I- e)P for all but an exponentially small fraction 
of coefficient vectors P in B. 

Control of the average power is a case in which we can 
formulate a direct check of what is required of the dictionary 
X, as is examined in Appendix B. 

Appendix B: Codeword power 

Here we examine the average and maximal power of the 
codewords. The maximal power has a role in our analysis of 
decoding. 

The power of a codeword c is its squared norm |cp, con- 
sisting of the average square of the codeword values across its 
n coordinates. The terminology power arises from settings in 
which codeword values are voltages on a communication wire 
or a transmission antenna in the wireless case, recalling that 
power equals average squared voltage divided by resistance. 

Average power for the signed subset code: Consider first 
our signed, subset superposition code. Each input correspond 
to a coefficient vector P = {(ij)jLi, where for each of the 
L sections there is only one j for which /3j is nonzero, and, 
having absorbed the size of the terms into the Xj, the nonzero 
coefficients are taken to be ±1. These are the coefficient 
vectors /3 of our codewords c = X/S, for which the power 
is |c|2 = 

With a uniform distribution on the binary input sequence 
of length K = Llog{2B), the induced distribution on the 
sequence of indices ji is independent uniform on the B choices 
in section i, and likewise the signs are independent uniform 
±1 valued, for i = 1,2, Fix a dictionary X, and 

consider the average of the codeword powers with this uniform 
distribution on inputs. 

By independence across sections, this average simplifies to 

i=l 

Now we consider the size of this average power, using the 
distribution of the dictionary X, with each entry independent 
Normal(0, P/L). This average power Px has mean EPx equal 
to P, standard deviation P^2/{Nn), and distribution equal 
to [P/{Nn)\X'^^, where X'^ is a Chi-square random variable 
with d = Nn degrees of freedom. Accordingly Px is very 
close to P. 

Indeed, in a random draw of the dictionary X, the chance 

that Px exceeds P + 2Py^ {\og{l / e)) /(Nn) is approximately 
less than e, as can be seen via the Chemoff-Cramer bound 
P{X^ > d + aV2d} < e-<^^2(»\/27d), for positive a, where 
the exponent D2{S) = {1/2)[S - log{l + S)] is near 5^/4 for 
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small positive S, so that the bound is near e which is e 
for a = a/2 log(l/e). 

Or we may appeal to the normal approximation for fixed a 
when d = Nn is large; the probability is not more than 0.05 
that the dictionary has average power Px outside the interval 
formed by the mean plus or minus two standard deviations 

P ± 2Py^2/{Nn). 

For instance, suppose P = 15 and the rate is near the 
capacity C = 2, so that Nn is near {LB){LlogB)/C, and 
pick L = 64 and B = 256. Then with high probability Px is 
not more than 1.001 times P. 

If the average power constraint is held stringently, with 
average power to be precisely not more than P, then in the 
design of the code proceed by generating the entries of X with 
power P'/L, where P' is less than P. The analysis of the 
preceding sections then carries through to show exponentially 
small probabihty of more than a small fraction of mistakes 
when R < C as long as P' is sufficiently close to P. 

Average power for the subset code: Likewise, let's con- 
sider the case of subset superposition coding without use 
of the signs. Once again fix X and consider a uniform 
distribution on inputs; it again makes the term selections j, 
independent and uniformly distributed over the B choices in 
each section. Now there is a small, but non-zero, average 
Xi = {^/B)J2jeseci -^3 °f the terms in each section i, 
and likewise a very small, but non-zero, overall average 
X = (l/L) J2i=i -^i- We need to make adjustments by these 
averages when invoking the section independence to compute 
the average power. Indeed, as in the rule that an expected 
square is the square of the expectation plus a variance, the 
average power is the squared norm of the average of the 
codewords plus the average norm squared difference between 
codewords and their mean. The mean of the codewords, with 
the uniform distribution on inputs, is X^iLi = which 
is a Normal(0, {P/B)I) random vector of length n. 

By independence of the term selections, the codeword 
variance is J2f=i{'i-/B)J2jeseci l^i ~ ^iP- Accordingly, in 
this subset coding setting. 

Using the independence of X^ and {Xj — X^ : j G seci) 
and standard distribution theory for sample variances, with a 
randomly drawn dictionary X, we have that Px is P/ (LBn) 
times a Chi-square random variable with nL{B — l) degrees 
of freedom, plus P/{nB) times an independent Chi-square 
random variable with n degrees of freedom. So it has mean 

equal to P and a standard deviation of P'\J~^ \J Tb ~^ ~ ~b^^ ' 
which is slightly greater than before. It again yields only a 
small departure from the target average power P, as long as 
n and B are large. 

Worst case power: Next we consider the matter of the size of 
the maximimi power P™"^ = max^ \X^\'^ among codewords 
for a given design X. The simplest distribution bound is to 



note that for each /3, the codeword Xj3 is, distributed as a 
random vector with independent Normal(0, P) coordinates, for 
which |X/3p is P/n times a Chi-square n random vector 
There are e"^ such codewords, with the rate written in nats. 
We recall the probabihty bound V{X^ > n{l+S)} < e-"'^^^^l 
Accordingly, by the union bound, P™<*^ is not more than 

P + PGaJ^ P+^log(l/e)^ 

except in an event of probability which we bound by 

gnflg-nD2(G2(ii+(logl/€)/n)) ^ jjjg jnyerse of 

the function Da (5) = {l/2)[S-logil+S)]. This G2(r) is seen 
to be of order 2y^ for small positive r and of order 2r for 
large r. Consequently, the bound on the maximum power is 
near P -|- PG2(P) rather than P. 

According to this characterization, for positive rate commu- 
nication, with subset superpositions, one can not rely, either in 
encoding or in decoding, on the norms being uniformly 

close to their expectation. 

Individual codeword power: We return to signed subset 
coding and provide exphcitly verifiable conditions on X such 
that for every subset, the power |X/3p is near P for most 
choices of signs. The uniform distribution on choices of signs 
amehorates between-section interference to produce simplified 
analysis of codeword power. 

The input specifies the term ji in each sections along with 
the choice of its sign given by sign^ in {—1, +1}, leading to 
coefficient vectors (3 equal to sign^ at position ji in section i, 
for i = 1, 2, . . . , L. The uniform distribution on the choices 
of signs leads to them being independently, equiprobable +1 
and —1. 

Now the codeword is given by X/3 = X^iLi sign,; . It 
has the property that conditional on X and the subset S = 
{ji : i = 1,2, ... , L}, the contributions sign^ Xj. for distinct 
sections are made to be mean zero uncorrelated vectors by the 
random choice of signs. In particular, again conditioning on 
the dictionary X and the subset S, we have that the power 
\Xl3\^ has conditional mean 

Px,s = j2\^jf, 

i=l 

which we shall see is close to P. The deviation from the con- 
ditional mean \X(3\'^ — Px,s equals J2ijki> ^^S'^i^^S'^i'^ji'^Ji/- 
The presence of the random signs approximately synametrizes 
the conditional distribution and leads to conditional variance 

Now concerning the colunms of the dictionary, the squared 
norms are uniformly close to P/L, since the number of 

such N = LB is not exponentially large. Indeed, by the union 
bound the maximum over the N colunms, satisfies 

max|X,f < ^ + ^ G2 (^^\og{N/e)^ , 

except in an event of probabihty bounded by e. 

Whence the conditional mean power Px,s is not more than 

P + PG2f-log(iV/e)) , 
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uniformly over all allowed selections of L term subsets. 
Note here that the polynomial size of N = LB makes the 
(logiV)/n small; this is in contrast to the worst case analysis 
above were the log cardinality divided by n is the fixed rate 
R. 

Next to show that the conditional mean captures the typ- 
ical power, we show that the conditional variance is small. 
Toward that end we examine the inner products Xj ■ Xji and 
their maximum absolute value maxj<j' \Xj ■ Xji\. Consider 
products of independent standard normals ZiZ^- These have 
moment generating function Ee-^^^^^ gqu^l to 1/(1 - A^)^/^. 
[This matches the moment generating function for half the 
difference in squares of independent normals found in Section 
2; to see why note that Z1Z2 equals half the difference in 
squares of (Zi + Z^)!^/^ and {Zi - Z2)/V2.] 

Accordingly V{Xj ■ X,y > {P/L)A} < e-"^(^), for posi- 
tive A, where -D(A) = D{A, 1). As previously discussed, this 
D{A) is near for small A and accordingly its inverse 

function G{r) is near ^/2r for small r. The corresponding 
two-sided bound is F{\Xj ■ Xj,\ > {P/L)A} < 2e-"^(^). 
By the union bound, we have that 

ma:K\Xj.Xj,\<jG(^log{Nye)), 
j<3' Li \n ) 

except for dictionaries X in an event of probability not more 
than e. 

Recall that the conditional variance of \Xj5\^ equals 
'^J^i^i'i-^ji ' -^Ji')'^- likely event that the above bound 

holds, we have that this conditional variance is not more than 
2p2gr2(^i iog(Ar2/£)), Consequently, the conditional distribu- 
tion of the power |X/3p given X and S is indeed concentrated 
near P. 

Accordingly, for each subset, most choices of sign produce 
a codeword with power near P. Moreover, for this 

codeword power property, it is enough that the individual 
colunms of the dictionary have p near P/L and Xj ■ Xjr 
near 0, uniformly over j ^ 
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